RANDOM FOREST¶

Ensemble Learning technique¶

Table of contents¶

  1. Bagging and Boosting
  2. Random Forest
  3. Randomness in Random Forest
  4. Bootstraping
  5. Criteria Selection
  6. Estimators
  7. Comaprision of accuracy with or without Considering Out of Bag errors
  8. Using CrossValidation to get the Best_parameters
  9. Limitations of random forest

Ensamble Learning¶

Ensemble methods are techniques that create multiple models and then combine them to produce improved results.

Bagging and Boosting¶

image.png

Bagging is a technique for "reducing prediction variance" by producing additional data for training from a dataset by combining repetitions with combinations to create multi-sets of the original data.
Boosting is an iterative strategy for "adjusting an observation's weight" based on the previous classification.

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees¶

image.png image-2.png the above image is showing how different random forest trees can give us different results.

Import Library¶
In [75]:
from sklearn.datasets import load_breast_cancer
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

Load the Dataset¶

In [76]:
iris = load_breast_cancer()

X = iris.data
y = iris.target

Test Train Split

In [77]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.40,random_state=0)

Randomness in Random Forest¶

Getting Data For Each Tree: Boot Strapping¶

All of the Decision Trees in a Random Forest use a slightly different set of data. They might be similar, but they are not the same. The final result is based on the votes from all the decision trees. The outcome of this is that anomalies tend to get smoothed over, since the data causing the anomalies will be in some of the decision trees, but not all of them, while the data that is more general will be in most if not all of the trees. When generating each tree, that tree has a unique set of data. That set is generated from a random subset of all of the available data, with replacement. This technique is known as bootstrapping. Each of the trees uses a set of data that is the same size of the original data set.

Criteria Selection¶

By default, a Random Forest will use the square root of the number of features as the maximum features that it will look on any given branch.It will pick the best split possible with those criteria. If there are absolutely no improvements available using the criteria options it has, it will continue evaluating additional criteria to find a useful split

In the lines below I first trained two random forest model one with out of bag score and one without, but since this is a small dataset there was not any notable diffrence between the accuracies of both models

Out of Bag Error :¶

Bagging uses subsampling with replacement to create training samples for the model to learn from.The Random Forest Classifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations . The out-of-bag (OOB) error is the average error for each i calculated using predictions from the trees that do not contain i in their respective bootstrap sample. This allows the RandomForestClassifier to be fit and validated whilst being trained

image.png

sounds like crossvalidation ?!!¶

Out-of-bag error and cross-validation (CV) are different methods of measuring the error estimate of a machine learning model. Over many iterations, the two methods should produce a very similar error estimate. That is, once the OOB error stabilizes, it will converge to the cross-validation (specifically leave-one-out cross-validation) error. The advantage of the OOB method is that it requires less computation and allows one to test the model as it is being trained.

However, there are a few downsides to cross validation. Setting aside some data means that you are training on only a subset of your model. If you have a small quantity of data, setting some aside could have a large impact on the results.

In [79]:
rf = RandomForestClassifier(n_estimators= 100, oob_score= True, n_jobs = 1)
rf1 = RandomForestClassifier(n_estimators= 100, oob_score= False, n_jobs = 1)
In [80]:
rf.fit(x_train,y_train)
a= rf.predict(x_train)
b = rf.predict(x_test)
In [81]:
rf1.fit(x_train,y_train)
c= rf.predict(x_train)
d = rf.predict(x_test)
In [94]:
fn=iris.feature_names
cn=iris.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (15,10), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')
In [93]:
# This may not the best way to view each estimator as it is small
fn=iris.feature_names
cn=iris.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 5,figsize = (10,2), dpi=900)
for index in range(0, 5):
    tree.plot_tree(rf.estimators_[index],
                   feature_names = fn, 
                   class_names=cn,
                   filled = True,
                   ax = axes[index]);

    axes[index].set_title('Estimator: ' + str(index), fontsize = 11)
fig.savefig('rf_5trees.png')

Accuracy¶

In [82]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,b )
Out[82]:
0.9385964912280702
In [83]:
accuracy_score(y_train, a)
Out[83]:
1.0
In [68]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,d )
Out[68]:
0.9385964912280702

CrossValidation¶

In [74]:
# 10-Fold Cross validation
print (np.mean(cross_val_score(rf, x_train, y_train, cv=10)))
0.9531932773109244
In [84]:
param_grid = {
                 'n_estimators': [5, 10, 15, 20,100],
                 'max_depth': [2,4, 5, 7, 9]
             }

grid_clf = GridSearchCV(rf, param_grid, cv=10)
grid_clf.fit(x_train, y_train)
Out[84]:
GridSearchCV(cv=10, estimator=RandomForestClassifier(n_jobs=1, oob_score=True),
             param_grid={'max_depth': [2, 4, 5, 7, 9],
                         'n_estimators': [5, 10, 15, 20, 100]})

Now, we can then get the best model using grid_clf. bestestimator and the best parameters using grid_clf. bestparams. Similarly we can get the grid scores using grid_clf.cvresults

In [85]:
grid_clf.best_estimator_
Out[85]:
RandomForestClassifier(max_depth=5, n_estimators=15, n_jobs=1, oob_score=True)
In [86]:
grid_clf.best_params_
Out[86]:
{'max_depth': 5, 'n_estimators': 15}
In [34]:
#grid_clf.cv_results_
In [89]:
rf2 = RandomForestClassifier(max_depth=5, n_estimators=15, n_jobs=1, oob_score=True)
rf2.fit(x_train,y_train)
e= rf2.predict(x_train)
f = rf2.predict(x_test)
In [39]:
rf = RandomForestClassifier(n_estimators= 100, oob_score= False, n_jobs = 1, max_depth = 7 )
In [91]:
accuracy_score(y_test, f)
Out[91]:
0.9517543859649122

Thus we have improved our model using cross validation.¶

Limitation¶

The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.